- 
                Notifications
    You must be signed in to change notification settings 
- Fork 64
[WIP] {2023.06}[foss/2023a] PyTorch v2.1.2 with CUDA/12.1.1 #586
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] {2023.06}[foss/2023a] PyTorch v2.1.2 with CUDA/12.1.1 #586
Conversation
…-layer into 2023.06-software.eessi.io-cuDNN-8.9.2.26-system
- `EESSI-install-software.sh`
  - use `scripts/gpu_support/nvidia/install_cuda_and_libraries.sh` with
    `scripts/gpu_support/nvidia/eessi-2023.06-cuda-and-libraries.yml`
- `create_lmodsitepackage.py`
  - consolidate `eessi_{cuda,cudnn}_enabled_load_hook` functions in a single one
    (`eessi_cuda_and_libraries_enabled_load_hook`)
  - the remaining hook is prepared to easily add new modules, e.g., cuTENSOR
- `eb_hooks.py`
  - put code that iterates over all files replacing non-distributable ones with
    symlinks into `host_injections` with a common function
    (`replace_non_distributable_files_with_symlinks`)
- `install_scripts.sh`
  - add files to copy to CVMFS (see `nvidia_files`)
- `scripts/gpu_support/nvidia/install_cuda_and_libraries.sh`
  - improved creation of tmp directory
    | Instance  
 | 
| Instance  
 | 
| We run a first attempt without doing any modifications (e.g., to work around issues)... bot: build inst:aws repo:eessi.io-2023.06-software arch:zen2 | 
| Updates by the bot instance  | 
| Updates by the bot instance  | 
| New job on instance  
 
 
 | 
| Building after applied changes provided by #579... bot: build inst:aws repo:eessi.io-2023.06-software arch:zen2 | 
| Updates by the bot instance  | 
| Updates by the bot instance  | 
| New job on instance  
 
 
 | 
| Trying again... bot: build inst:aws repo:eessi.io-2023.06-software arch:zen2 | 
| Updates by the bot instance  | 
| Updates by the bot instance  | 
| Updates by the bot instance  | 
| New job on instance  
 
 | 
| Commented out code (in  bot: build inst:aws repo:eessi.io-2023.06-software arch:zen2 | 
| Updates by the bot instance  | 
| Updates by the bot instance  | 
| New job on instance  
 
 | 
| Added  bot: build inst:aws repo:eessi.io-2023.06-software arch:zen2 | 
| Updates by the bot instance  | 
| Updates by the bot instance  | 
| New job on instance  
 
 | 
| Try again... bot: build inst:aws repo:eessi.io-2023.06-software arch:zen2 | 
| Updates by the bot instance  | 
| Updates by the bot instance  | 
| New job on instance  
 | 
| Now, try building for multiple compute capabilities ( bot: build inst:aws repo:eessi.io-2023.06-software arch:zen2 | 
| Updates by the bot instance  | 
| Updates by the bot instance  | 
| New job on instance  
 | 
| Currently not actively being worked on because we need to have rework/implement support for building for GPUs which also depends on support for dev.eessi.io | 
| @trz42 Can you retarget this pr. And also give some pointers on what this is stuck behind or what should be worked on? | 
| Superseded by #973 | 
WORK IN PROGRESS
Eventually, this is aimed at adding PyTorch/2.1.2 with CUDA/12.1.1. However, building it may not work out of the box, so this is for documenting the progress, issues we hit and workarounds applied.
PyTorch with CUDA requires cuDNN, hence this PR also builds it using the same changes provided by #581 and #579 (however, the changes by the latter would have to be ingested, hence we need additional changes here; we try to document well what we do, and why).
Initially, we only build for compute capability
7.0, later we build for architectures fromPascalbut excluding architectures for embedded GPUs and very special compute capabilities such as9.0a. That is the list of compute capabilities could be6.0,6.1,7.0,7.5,8.0,8.6,8.9,9.0